Inferring potential non-disclosed men who have sex with men among self-reported heterosexual men with HIV in Southwest China: A genetic network study

Background In Guangxi province of China, there is a high prevalence of HIV in the general population and in men who have sex with men (MSM). However, there is still a low proportion of MSM among people living with HIV. This apparent contradiction could be due to the high proportion of potential non-disclosed MSM (pnMSM) among reported heterosexual men. We analyzed the genetic linkage of HIV specimens to address this problem aiming to (1) identify the optimal genetic distance threshold, which gave the highest number of genetic clusters, (2) document the proportion of potential non-disclosed MSM (pnMSM) among self-reported heterosexual men, and (3) determine predictors for pnMSM. Methods Pairwise genetic distances were computed among all samples. The genetic distance threshold giving the highest number of genetic clusters was identified. Self-reported heterosexual men were identified as belonging to the pnMSM group if they could be linked to any MSM in their cluster. Multinomial logistic regression was used to determine associated factors of being pnMSM. Results The optimal genetic distance threshold was 0.75% substitutions/site. Among 896 self-reported heterosexual men, the frequency (percentage and 95% confidence interval) was 62 (6.9%, 5.2–8.6%) for pnMSM, 779 (86.9%, 84.7–89.1%) for indeterminate men and 55 (6.1%, 4.5–7.7%) for potential heterosexual men, respectively. Self-reported heterosexual men who were younger, single and more educated were more likely to be pnMSM. Conclusion Based on these findings, there is a need to pay more attention to sexually active, young and educated self-reported heterosexual men and provide them with voluntary counselling and testing and specific HIV prevention services.


Introduction
Guangxi is a province in south-western China and has the third highest number of HIV/AIDS cases reported at the provincial level [1]. The prevalence of HIV among men who have sex with men (MSM) reported from the Integrated Biologic and Behavioral Surveys (IBBS) increased from 0.83% in 2008 to 11.2% in 2015 [2] and then stabilized at around 10% in 2017. However, MSM contributed only 3.13% of all reported HIV/AIDS cases during 2010-2017 [3] and rose to 7.0% in 2020. These percentages were relatively low compared with neighboring provinces, which ranged from 14.4% in Chongqing to 43.5% in Hainan.
Although same-sex behaviour is not illegal in China, being homosexual or associating with MSM has negative social and cultural consequences, including rejection by family members and loss of employment [4]. Hence, stigma towards homosexual identity is common in China [5,6]. Many HIV infected MSM in China might tend to self-report as heterosexual men to avoid stigmatization [7], which could lead to an underestimation of the MSM population size. MSM who conceal their sexual identity would also miss the opportunity to access beneficial services such as MSM-specific education packages, and pre-and postexposure prophylaxis [8].
With HIV genetic sequencing technology, pairwise genetic distances of two or more persons can be determined. With an optimal genetic distance threshold, HIV genetic clusters (or networks) can be constructed and two network members with a genetic distance less than this threshold will form a linkage. Previous studies identified potential non-disclosed men who have sex with men (pnMSM) based on whether the clusters contained MSM and self-reported heterosexual men only [9,10]. An alternative method to determine pnMSM is to use the genetic linkage, which may give a higher precision of possible transmission.
A genetic linkage between two HIV-infected people strongly implies HIV transmission between them. Furthermore, if the two people are not intravenous drug users, the linkage may imply sexual contact. Thus, if a man who reports himself as heterosexual is genetically linked to a man who has sex with other men, then he can be considered to belong to the MSM group. The aims of this study were (1) to use nucleotide sequences Pol-region and Reverse transcriptase (PR-RT) information of HIV samples obtained from self-reported heterosexual men, MSM, and women to identify the optimal genetic distance threshold, which results in the highest number of clusters (networks), (2) to document the proportion of pnMSM among selfreported heterosexual men, and (3) to determine predictors for pnMSM.
Our study provides a method to estimate the proportion of pnMSM among self-reported heterosexual men in any community where such a proportion needs to be documented.

PLOS ONE
Inferring potential non-disclosed men who have sex with men among self-reported heterosexual men with HIV

Ethics statement
Informed consent was waived since we used leftover specimens confirmed as HIV positive by the Western Blot test for phylogenetic analysis. The confidentiality of all study participants was maintained and their well-being was not affected. Participants' epidemiological data were obtained by matching cases with the Guangxi HIV/AIDS case report dataset through the unique ID of the specimens. The study was approved by the ethics committee of Prince of Songkla University (REC. 63-348-18-1) and the institutional review board of the Guangxi Centre for Disease Prevention and Control (GXIRB2020-0069).

Study setting
The study was conducted in Nanning, the capital city of Guangxi. The number of reported HIV/AIDS cases during 2016-2020 was 8463, of whom 7696 (90.9%) were reported as being infected through sexual transmission. The proportion of injecting drug use HIV/AIDS cases accounted for less than 2.0% of the total reported cases and were therefore excluded from the analysis. Among the sexual transmission cases, 6312 (82.0%) were self-reported heterosexuals and 1384 (18.0%) identified as MSM.

Study design and study subjects
This was a cross-sectional HIV genetic-network study. In order to achieve the required sample size, we combined HIV/AIDS treatment-naïve cases diagnosed during 2016-2020 aged 18 years or older from two major types of service sites. One was from local hospitals with routinely detected HIV/AIDS cases. Most of these persons were self-reported heterosexuals. The other was from a voluntary counselling and testing clinic under the jurisdiction of the Guangxi Centre for Disease Control and Prevention where the majority of clients were MSM. These two sites combined were expected to cover over 90.0% of the HIV/AIDS reported cases in Nanning. We used leftover specimens of HIV Western Blot confirmatory test to do the genetic analysis. After removing 127 duplicates based on linkage of HIV sequence ID and citizen ID, which was conducted by other members outside of our study group, 1047 samples from the hospitals and 928 samples from the clinic were included in the phylogenetic analysis. For HIV sequencing, the pol fragments (HXB2 positions 2253-3870, minimum length 900 bp) were amplified and sequenced using an in-house polymerase chain reaction protocol according to previously published methods [11,12]. Variables included in the analysis were age at HIV diagnosis, gender, self-reported HIV risk category, education level, marital status, occupation, ethnicity, CD4 count at diagnosis, and year of HIV diagnosis.
The Tamura-Nei 93 (TN93) nucleotide substitution model was applied to calculate the pairwise genetic distances for all 1975 sequences across different thresholds (0.1% to 2.0%) in order to determine the optimal threshold, at which, the molecular network resolution is highest with the maximum number of clusters shown. This principle is outlined in a previous study [14] and has been applied in several recent studies [15,16]. The transmission networks were reconstructed using the HIV Transmission Cluster Engine (HIV-TRACE, http://demo. hivtrace.org/network.html), which can identify groups of putative transmission partners and assemble them in clusters regardless of HIV subtypes, potentially identifying more accurate transmission chains [17,18]. We resolved all two-way ambiguities to match their possible single character states, averaged all other ambiguities, ignored positions where either sequence has a gap, and for ambiguity rich sequences (>5% of bases are ambiguous), we averaged all resolutions [19].
The transmission clusters were used to identify pnMSM, which is the focus of this paper. We omitted the display and discussion of the phylogenetic tree to make this paper more concise.

Identification of pnMSM
Cluster-based classification method. The cluster-based classification method has been described elsewhere [9,10]. Briefly, the method first chooses an optimal genetic distance threshold to group the sample into clusters whereby a member must be genetically close to at least one other member with a distance shorter than the threshold. A self-reported heterosexual man is classified as an MSM (pnMSM) if all other members in its cluster consist of MSM and self-reported heterosexual men only. A singleton is one whose HIV virus cannot be linked to any other person's based on the genetic distance threshold.
Linkage-based classification method. The linkage-based method classifies a self-reported heterosexual man based on the genetic distance between his HIV and others' in the study sample. If their genetic distance is not larger than the pre-set threshold, it would be probable that there was a transmission between them. Persons who had such close genetic distance of HIV could form a cluster or a network. Table 1 shows the criteria of self-reported heterosexual men classification according to seven types of linkages between them and other members of the network. If a self-reported heterosexual man had ANY direct link with MSM, he was classified as pnMSM as we assumed that most of the transmission was by sexual route. If the HIV had only genetic link with women, the person was classified as "Potential heterosexual men". Otherwise, he was classified as "Indeterminate".
The combination of self-reported heterosexual man with more than one type of subgroups could be seen in the network no matter linkage-based or cluster-based method were Table 1. Classification of self-reported heterosexual men based on who they were linked to in the same network. performed. However, linkage-based method could identify pnMSM from combination groups of "self-reported heterosexual men, Women and MSM" based on the linkage, while the cluster-based method could not due to it did not consider any clusters containing woman.

Who they linked to Classification
To make the results comparable with the cluster-based method, the threshold for the genetic distance is set to the same value. Thus, the two methods will share the same set of singletons.
Classification process. According to the classification criteria, we classified self-reported heterosexual men repeatedly over many rounds.
In the first round, we defined self-reported heterosexual men as pnMSM in the original network based on their linkage with any MSM. In the second and further rounds, we reclassified a self-reported heterosexual man into the pnMSM group if he could be linked with pnMSM identified from a previous round. This process continued until there were no further changes.
We compared the results from the linkage-based method with the cluster-based method to examine the extent of over-and under-estimation.

Genetic network inference based on optimal genetic distance threshold
Both the cluster-based and linkage-based methods attempt to identify persons in clusters. The precision of the cluster definition depends on the genetic distance threshold. The shorter the distance, the stronger the evidence of transmission. However, setting a strict (too short distance) threshold will result in a higher number of singletons and more precise clusters. However, having an excessive number of singletons will not be useful for identification of pnMSM. On the other hand, relaxing the threshold will result in fewer singletons but less precise clusters. The choice of threshold, therefore, strongly affects the cluster-based method. As the cluster size is enlarged, the chance of linking a heterosexual man into an MSM cluster will increase.
The optimal threshold can give the maximum number of clusters. This will avoid too many singletons being identified and the cluster size being too large. We performed the network construction iteratively under different genetic distance thresholds (ranging from 0.001 to 0.02 substitutions/site) and plotted the changing threshold against the number of clusters and links produced to determine the optimal threshold value.

Statistical analysis
Demographic variables were presented descriptively using frequency and percentage. The distribution of cluster types across the nodes were analyzed using the R language and environment [20]. 95% confidence intervals for estimating proportions were calculated according to p ±1.96 , where p is the proportion of pnMSM among self-reported heterosexual men with HIV, q is 1-p, and n is the total number of self-reported heterosexual men with HIV in our study. Associated factors of being pnMSM were determined using a multinomial logistic regression model with three outcome groups, namely pnMSM, potential heterosexual men and indeterminate men. We chose pnMSM as the referent group against which each of the two other groups were compared. If the degree and direction of the association between pnMSM and both groups were consistent, the predictors identified would be reasonably robust. However, in computation, when using pnMSM as the referent group, a relative risk ratio >1 would indicate that the independent variable was predicting non-pnMSM. In interpretation of the results of the regression, reverse from conventional interpretation of an odds ratio (>1 is a risk factor, <1 is a protective factor), the predictors with a relative risk ratio < 1 is actually a positive predictor.

Results
Basic characteristics of the study samples under optimal genetic distance threshold S1 Table illustrates the extent of mixture for subjects recruited from the two kinds of study sites and indicates that subjects were well distributed throughout the phylogenetic tree.
S2 Table shows the risk-gender distribution of all subjects from the two kinds of study sites mixed together. The proportion of MSM overall was 33.3%.
Out of 7,696 HIV/AIDS cases diagnosed during study period and self-reported as being infected with HIV via sexual contact, 1,975 (25.7%) had left-over specimens with an adequate quantity and acceptable quality for RNA sequencing. Table 2 compares characteristics of the study subjects. Compared to self-reported heterosexual men, MSM were younger, more educated, and had a higher proportion of singles and members of the Han ethnic group. HIV-1 subtypes were predominated by CRF07_BC among MSM whereas CRF01_AE was more common among self-reported heterosexual men. A low CD4+ count was more common among self-reported heterosexual men compared to MSM.   Table 1, there were 782 indeterminate men classified in round 1, of whom 517 were singletons. In the second round, the pnMSM classified in round 1 were assumed to be MSM and the criteria were reapplied. Three indeterminate men from the first round were reclassified as pnMSM due to their linkage with newly identified pnMSM. Reiteration of this process for the third round gave no change in the results and the process ended. Thus, eventually, there were 62 pnMSM, 55 potential heterosexual men and 779 indeterminate men. Their prevalence (95% confidence interval) were: 6.9% (5.2-8.6%), 6.1% (4.5-7.7%) and 86.9% (84.7-89.1%), respectively.

Genetic network visualization
To explore the connections between self-reported heterosexual men and other nodes, especially with MSM, we visualized the genetic network with clusters containing self-reported heterosexual men and related nodes.

PLOS ONE
Inferring potential non-disclosed men who have sex with men among self-reported heterosexual men with HIV singletons omitted. Fig 2A ( Table 3 compares the numbers of pnMSM identified by the linkage-based and cluster-based methods. The linkage-based method gave a higher number of pnMSM for all genetic distance thresholds. S3 Table shows the details of the comparison of the number and proportion of pnMSM identified from self-reported heterosexual men from the two methods at all thresholds (ranging from 0.005 to 0.015 substitutions/site).

Comparison between MSM and pnMSM
S4 Table compares characteristics between MSM and pnMSM identified by the linkage-based classification method. The pnMSM group were generally older, less educated, more likely to be married, and had lower CD4+ counts. In general, their background characteristics were in between self-reported heterosexual men and MSM. Table 4 summarizes the multinomial regression model to determine associated factors for pnMSM. Since pnMSM was the baseline outcome group, relative risk ratios in this table are

PLOS ONE
interpreted from the conventional case-control study. A risk ratio less than 1 indicates that the person in that specific level of the indendent variable were more likely to be pnMSM. The relative risk of being a pnMSM increased with decreasing age group. Using <40 years old as the referent group, subjects in the oldest age group were less likely to be classified as pnMSM. Similarly, other predictors for pnMSM included being single, and having a higher educational background. Other variables were not statistically significant.

PLOS ONE
Inferring potential non-disclosed men who have sex with men among self-reported heterosexual men with HIV

Discussion
With an optimal genetic distance threshold identified of 0.0075 substitutions/site, nearly 7% of the self-reported heterosexual men in this study were classified as pnMSM while almost 87% were indeterminate. The number of pnMSM identified by the linkage-based method was higher than those identified by the cluster-based method. The patterns of difference between pnMSM and potential heterosexual men among self-reported heterosexual men was similar with the difference between MSM and self-reported heterosexual men among the HIV positive men. MSM and the pnMSM tended to be single, young and more educated than the heterosexual men.
The HIV epidemic in Guangxi province is driven by heterosexual transmission [1,3]. The basic characteristics of self-reported heterosexual men were consistent with previous studies in Guangxi [3,21]. Self-reported heterosexual men having a low CD4 count at diagnosis indicated that they were diagnosed at a relatively late stage and demonstrates the problem of access to care among elderly groups who were infected from heterosexual contact. Previous studies found that older heterosexual men in Guangxi preferred to have high risk sexual contact with sexual workers, which caused their HIV infection [22,23]. Yet, they were rarely screened for HIV and not detected until they became seriously ill in hospital. Our study suggests that this population needs education for HIV prevention and early counselling/testing to prevent HIV infection and complications from HIV infection.
Genetic distance reflects how far away in the past two viruses had shared the same genome. A shorter distance threshold allows only recent transmissions to be detected, most of the cases will be classified as singletons, and less recent transmissions would not be detected. On the other hand, a relaxed threshold would allow past transmissions to be included in the clusters. This can dilute recent transmissions, which are more imminent to control. We chose a genetic distance threshold which gave the maximum number of clusters. The method gave a good balance within this dilemma. The 0.75% substitutions/site is equivalent to 3.5-4 years of HIV evolution [24]. Thus, the transmission among the linked cases was moderately recent.
Theoretically, the linkage-based method yields a higher probability in identifying more pnMSM subjects than the cluster-based method. This is because the presence of a woman in a cluster could simply, and maybe incorrectly, prevent self-reported heterosexual men from being classified as pnMSM even if the woman had no genetic linkage and he is actually linked with other MSM in the same cluster.
We found that pnMSM were significantly younger and well educated compared to the other groups, and had more single (unmarried) subjects than potential heterosexual men. This is consistent with the characteristics of registered MSM. Elder men were therefore less likely to have same sex practice. The difference of risk in the age groups may be the birth cohort effect where new generations of men are more prone to this type of sexual activity whereas the older generation are more accustomed to heterosexual sex with commercial sex workers. Non-disclosure of sexual status may be due to stigma [7], which may lead to delays in receiving counselling and HIV testing. A newly developed technology for self-testing of HIV should be made available for this high-risk group while they are not ready to expose themselves to counselling and testing service. Legal measures are also needed to oppose social discrimination against HIV infection and homophobia [25]. These efforts will encourage pnMSM to disclose their same-sex behaviors and seek pre-and post-exposure prophylaxis more freely. There are several limitations in this study. The major one was the adequacy of the leftover specimens. The missing cases could have led to incomplete analysis of the genetic linkages and led to a high number of singletons which were classified as indeterminate men. Secondly, missing cases who were men would also reduce the chance of having men in the cluster and thus biased toward the non-missing self-reported heterosexual men being classified away from pnMSM. Thirdly, the indeterminate men who were linked to both men and women would have higher likelihood of being pnMSM compared to those who were singleton. All of these limitations might lead to underestimation of pnMSM in the population. On the other hand, based on self-reporting, possible incomplete exclusion of drug users among the study men might lead to over reporting of pnMSM cases.

Conclusions
Based on our findings, there is a need to pay more attention to sexually active, young and educated self-reported heterosexual men who are more likely to be pnMSM. Increased counselling and testing services and specific HIV preventions should be provided to this group.